前页的结果并非偶然,在对真实模型一无所知 (只能假设等概率出现) 的情况下,所有算法的期望错误率均为$1/2$,与随便猜等同,这称为没有免费的午餐 (no free lunch, NFL) 定理
设数据集$D \subseteq (\Xcal \times \{0,1\})^m$,其中$\Xcal$是离散的,$p(f \mid A, D)$为算法$A$基于$D$产生模型$f$的概率
模型集合$\Gcal = \{ \Xcal \mapsto \{0,1\} \}$,易知$|\Gcal| = 2^{|\Xcal|}$
给定$g \in \Gcal$为真实模型,则算法$A$的期望预测错误率为
$$ \begin{align*} \quad E (A \mid D, g) = \Ebb_\xv \int_f \Ibb(f(\xv) \ne g(\xv)) \cdot p(f \mid A, D) ~ \diff f \end{align*} $$
模型集合$\Gcal$在机器学习理论中称为假设空间 (hypothesis space)。
假设$\Gcal$中模型为真实模型的概率均为$1 / |\Gcal|$,则算法$A$的期望预测错误率为
$$ \begin{align*} \quad \sum_g \frac{E (A \mid D, g)}{|\Gcal|} & = \sum_g \frac{1}{|\Gcal|} \Ebb_\xv \int_f \Ibb(f(\xv) \ne g(\xv)) \cdot p(f \mid A, D) ~ \diff f \\ & = \Ebb_\xv \int_f p(f \mid A, D) \frac{1}{|\Gcal|} \underbrace{\sum_g \Ibb(f(\xv) \ne g(\xv))}_{|\Gcal| / 2} ~ \diff f \\ & = \Ebb_\xv \int_f \frac{p(f \mid A, D)}{2} ~ \diff f = \Ebb_\xv \frac{1}{2} = \frac{1}{2} \end{align*} $$
其中第二行是因为$\Xcal$是离散的,对任意$\xv \in \Xcal$,$\Gcal$中的$2^{|\Xcal|}$个模型恰有一半预测$\xv$为正、一半预测$\xv$为负
NFL 定理的启示:脱离具体的任务空谈什么算法好没有意义!
开汽车 vs. 骑电瓶车
NFL 定理假设了$g$的等可能性,但根据已有数据其实可以确定假设空间中部分模型为真实模型的可能性较高、而另一些则较低,因此学习算法自身的{==偏倚==} (bias) 应与任务 (数据) 相匹配
给定模型$f$、数据集$D = \{ (\xv_i, y_i) \}_{i \in [m]}$
from sklearn.metrics import mean_squared_error y_true = [3, -0.5, 2, 7] y_pred = [2.5, 0.0, 2, 8] print(mean_squared_error(y_true, y_pred)) # 均方误差 # 0.375 print(mean_squared_error(y_true, y_pred, squared=False)) # 均方根误差 # 0.6123724356957945
原始数据:表格、图片、视频、文本、语音、……
模型学习:最核心的部分,学习一个用来预测的映射
特征工程:
from sklearn.datasets import load_breast_cancer breast_cancer = load_breast_cancer() print(breast_cancer.DESCR) # -------------------- # **Data Set Characteristics:** # # :Number of Instances: 569 # # :Number of Attributes: 30 numeric, predictive attributes and the class # # :Attribute Information: # - radius (mean of distances from center to points on the perimeter) # - texture (standard deviation of gray-scale values) # - perimeter # - area # - smoothness (local variation in radius lengths) # - compactness (perimeter^2 / area - 1.0) # - concavity (severity of concave portions of the contour) # - concave points (number of concave portions of the contour) # - symmetry # - fractal dimension ("coastline approximation" - 1) # # The mean, standard error, and "worst" or largest (mean of the three # worst/largest values) of these features were computed for each image, # resulting in 30 features. For instance, field 0 is Mean Radius, field # 10 is Radius SE, field 20 is Worst Radius. # # - class: # - WDBC-Malignant 恶性 # - WDBC-Benign 良性 # # :Summary Statistics: # # ===================================== ====== ====== # Min Max # ===================================== ====== ====== # radius (mean): 6.981 28.11 半径 # texture (mean): 9.71 39.28 纹理 # perimeter (mean): 43.79 188.5 周长 # area (mean): 143.5 2501.0 面积 # smoothness (mean): 0.053 0.163 平滑度 # compactness (mean): 0.019 0.345 紧凑度 # concavity (mean): 0.0 0.427 凹度 # concave points (mean): 0.0 0.201 凹点 # symmetry (mean): 0.106 0.304 对称性 # fractal dimension (mean): 0.05 0.097 分形维数 # radius (standard error): 0.112 2.873 # texture (standard error): 0.36 4.885 # perimeter (standard error): 0.757 21.98 # area (standard error): 6.802 542.2 # smoothness (standard error): 0.002 0.031 # compactness (standard error): 0.002 0.135 # concavity (standard error): 0.0 0.396 # concave points (standard error): 0.0 0.053 # symmetry (standard error): 0.008 0.079 # fractal dimension (standard error): 0.001 0.03 # radius (worst): 7.93 36.04 # texture (worst): 12.02 49.54 # perimeter (worst): 50.41 251.2 # area (worst): 185.2 4254.0 # smoothness (worst): 0.071 0.223 # compactness (worst): 0.027 1.058 # concavity (worst): 0.0 1.252 # concave points (worst): 0.0 0.291 # symmetry (worst): 0.156 0.664 # fractal dimension (worst): 0.055 0.208 # ===================================== ====== ====== # # :Missing Attribute Values: None # # :Class Distribution: 212 - Malignant, 357 - Benign # # :Creator: Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian # # :Donor: Nick Street # # :Date: November, 1995 # # This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets. # https://goo.gl/U2Uwz2 # # # # Separating plane described above was obtained using # Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree # Construction Via Linear Programming." Proceedings of the 4th # Midwest Artificial Intelligence and Cognitive Science Society, # pp. 97-101, 1992], a classification method which uses linear # programming to construct a decision tree. Relevant features # were selected using an exhaustive search in the space of 1-4 # features and 1-3 separating planes. # # The actual linear program used to obtain the separating plane # in the 3-dimensional space is that described in: # [K. P. Bennett and O. L. Mangasarian: "Robust Linear # Programming Discrimination of Two Linearly Inseparable Sets", # Optimization Methods and Software 1, 1992, 23-34]. # # This database is also available through the UW CS ftp server: # # ftp ftp.cs.wisc.edu # cd math-prog/cpo-dataset/machine-learn/WDBC/ # # .. dropdown:: References # # - W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction # for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on # Electronic Imaging: Science and Technology, volume 1905, pages 861-870, # San Jose, CA, 1993. # - O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and # prognosis via linear programming. Operations Research, 43(4), pages 570-577, # July-August 1995. # - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques # to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) # 163-171.